반복에서 주목으로: 순차 모델링의 한계 극복하기

전통적인 순차 모델링은 크게 순환 신경망(RNNs) 그리고 그들의 게이트형 변종들(자기회귀신경망, GRU)에 의존해왔다. 초기 시퀀스-투-시퀀스 작업에서 혁신적이었지만, 긴 의존성 처리 시 기본적인 확장성 문제를 겪는다. 주목 메커니즘의 도입은 이러한 제약을 극복하고 현대적이고 효과적인 자연어 처리 시스템을 가능하게 하는 핵심적인 개념적 돌파구를 제공했다.

1. 장거리 의존성 문제

RNN에서는 토큰 $t_i$와 토큰 $t_j$ 사이의 의존 경로가 모든 중간 단계를 차례로 거쳐야 한다. 이로 인해 역전파 과정에서 기울기 신호가 가중치 행렬을 반복적으로 곱하게 되며, 결과적으로 신호가 급격히 감쇠(사라지는 기울기)하게 되어, 시퀀스 내 긴 거리에 걸쳐 유용한 정보나 오류 신호를 전달하는 것이 거의 불가능해진다. 경로 복잡도는 $O(N)$이다.

2. 고정 크기 컨텍스트 버블넥

표준 인코더-디코더 주목 메커니즘이 도입되기 전의 아키텍처들은 길이에 관계없이 소스 시퀀스의 전체 의미를 하나의 고정된 차원 벡터(즉, 컨텍스트 벡터, $C$)로 압축해야 했다. 이 버블넥은 특히 긴 또는 복잡한 입력에 대해 모델이 필요한 정보를 모두 유지할 수 있는 능력을 심각하게 제한하며, 디코딩 단계에서 중요한 정보 손실을 초래한다.

개념적 표현

RNN Context Bottleneck

A visualization illustrating the traditional RNN Encoder-Decoder structure where the sequence is compressed into a single, fixed-size vector before being passed to the decoder. This point of compression often results in the loss of fine-grained information required for accurate long-sequence translation.

Diagram of an RNN Encoder-Decoder showing the context vector bottleneck

Question 1

Why is the dependency path length in a standard RNN considered a major limitation for long sequences?

Path complexity is $O(1)$.

Path complexity is $O(N^2)$.

Path complexity is $O(N)$, causing vanishing gradients.

It prevents the use of LSTMs.

Question 2

In pre-Attention Seq2Seq models, what component represents the 'information bottleneck'?

The softmax layer.

The recurrent cell (e.g., GRU).

The fixed-size context vector derived from the encoder's final hidden state.

The input embedding layer.

Challenge: Conceptualizing Attention's Advantage

Comparing Structural Complexity

Consider a sequence of length $N$. We want to establish a dependency between token $X_i$ and token $Y_j$.

Contrast the dependency path length required by:

Traditional Recurrence (e.g., LSTM)
Attention Mechanism (Query-Key comparison)

Step 1

How does Attention fundamentally reduce the structural complexity of establishing distant dependencies?

Solution:
Attention creates a direct, non-sequential connection between any output token $Y_j$ and any input token $X_i$ by calculating a weight based on their vector similarity ($Q_j K_i^T$). The dependency path length is effectively $O(1)$ (a direct look-up), removing the constraint of linear path traversal imposed by recurrence ($O(N)$).